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Abstract 

Given the superposition of a low-rank matrix plus the product of a known fat compression matrix 
times a sparse matrix, the goal of this paper is to establish deterministic conditions under which exact 
recovery of the low-rank and sparse components becomes possible. This fundamental identifiability issue 
arises with traffic anomaly detection in backbone networks, and subsumes compressed sensing as well 
as the timely low-rank plus sparse matrix recovery tasks encountered in matrix decomposition problems. 
Leveraging the ability of l\- and nuclear norms to recover sparse and low-rank matrices, a convex program 
is formulated to estimate the unknowns. Analysis and simulations confirm that the said convex program can 
recover the unknowns for sufficiently low-rank and sparse enough components, along with a compression 
matrix possessing an isometry property when restricted to operate on sparse vectors. When the low-rank, 
sparse, and compression matrices are drawn from certain random ensembles, it is established that exact 
recovery is possible with high probability. First-order algorithms are developed to solve the nonsmooth 
convex optimization problem with provable iteration complexity guarantees. Insightful tests with synthetic 
and real network data corroborate the effectiveness of the novel approach in unveiling traffic anomalies 
across flows and time, and its ability to outperform existing alternatives. 
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I. Introduction 

Let X 6 R LxT be a low-rank matrix [r := rank(X ) <C min(L,T)], and let A G R FxT be sparse 
(s := ||A ||o <C FT, || • ||o counts the nonzero entries of its matrix argument). Given a compression matrix 
RGl IxF with L < F, and observations 

Y = X + RA (1) 

the present paper deals with the recovery of {Xo, Ao}. This task is of interest e.g., to unveil anomalous 
flows in backbone networks [23], [25], [39], to extract the time- varying foreground from a sequence of 
compressed video frames [37], or, to identify active brain regions from undersampled functional magnetic 
resonance imagery (fMRI) [15]. In addition, this fundamental problem is found at the crossroads of 
compressive sampling (CS), and the timely low-rank-plus-sparse matrix decompositions. 

In the absence of the low -rank component (Xo = Olxt), one is left with an under-determined sparse 
signal recovery problem; see e.g., [12], [31] and the tutorial account [13]. When Y = Xo + Ao, 
the formulation boils down to principal components pursuit (PCP), also referred to as robust principal 
component analysis (PCA) [10], [14], [18]. For this idealized noise-free setting, sufficient conditions for 
exact recovery are available for both of the aforementioned special cases. However, the superposition 
of a low-rank and a compressed sparse matrix in (1) further challenges identifiability of {Xo, Ao}. In 
the presence of 'dense' noise, stable reconstruction of the low-rank and sparse matrix components is 
possible via PCP [38], [40]. Earlier efforts dealing with the recovery of sparse vectors in noise led to 
similar performance guarantees; see e.g., [5] and references therein. Even when Xo is nonzero, one could 
envision a CS variant where the measurements are corrupted with correlated (low -rank) noise [15]. Last 
but not least, when Ao = Ofxt and Y is noisy, the recovery of Xo subject to a rank constraint is nothing 
else than PCA - arguably, the workhorse of high-dimensional data analysis [22]. 

The main contribution of this paper is to establish that given Y and R in (1), for small enough r and 
s one can exactly recover {Xo, Ao} by solving the nonsmooth convex optimization problem 

(PI) min IIXL + A||A||i 

{X,A} 

s.to Y = X + RA 

where A > is a tuning parameter; ||X||* := J2i a iP^-) * s tne nuclear norm of X (a-i stands for 
the i-th singular value); and, ||X||i := X^jl^jl denotes the ^i-norm. The aforementioned norms are 
convex surrogates to the rank and ^o- norm > respectively, which albeit natural as criteria they are NP-hard 
to optimize [16], [28]. Recently, a greedy algorithm for recovering low -rank and sparse matrices from 
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compressive measurements was put forth in [37]. However, convergence of the algorithm and its error 
performance are only assessed via numerical simulations. A recursive algorithm capable of processing 
data in real time can be found in [15], which attains good performance in practice but does not offer 
theoretical guarantees. 

A deterministic approach along the lines of [14] is adopted first to derive conditions under which (1) is 
locally identifiable (Section II). Introducing a notion of incoherence between the additive components Xo 
and RAo, and resorting to the restricted isometry constants of R [12], sufficient conditions are obtained 
to ensure that (PI) succeeds in exactly recovering the unknowns (Section III-A). Intuitively, the results 
here assert that if r and s are sufficiently small, the nonzero entries of Ao are sufficiently spread out, 
and subsets of columns of R behave as isometries, then (PI) exactly recovers {Xo, Ao}. As a byproduct, 
recovery results for PCP and CS are also obtained by specializing the aforesaid conditions accordingly 
(Section III-B). The proof of the main result builds on Lagrangian duality theory [3], [8], to first derive 
conditions under which {Xo, Ao} is the unique optimal solution of (PI) (Section IV-A). In a nutshell, 
satisfaction of the optimality conditions is tantamount to the existence of a valid dual certificate. Stemming 
from the unique challenges introduced by R, the dual certificate construction procedure of Section IV-B is 
markedly distinct from the direct sum approach in [14], and the (random) golfing scheme of [10]. Section 
V shows that low-rank, sparse, and compression matrices drawn from certain random ensembles satisfy 
the sufficient conditions for exact recovery with high probability. 

Two iterative algorithms for solving (PI) are developed in Section VI, which are based on the accelerated 
proximal grandient (APG) method [2], [24], [29], [30], and the alternating-direction method of multipliers 
(AD-MoM) [4], [8]. Numerical tests corroborate the exact recovery claims, and the effectiveness of (PI) in 
unveiling traffic volume anomalies from real network data (Section VII). Section VIII concludes the paper 
with a summary and a discussion of limitations, possible extensions, and interesting future directions. 
Technical details are deferred to the Appendix. 

A. Notational conventions 

Bold uppercase (lowercase) letters will denote matrices (column vectors), and calligraphic letters will 
denote sets. Operators (•)', (•)', tr(-), vec(-), diag(-), A max (-)> 0"min( - )> and <g> will denote transposition, 
matrix pseudo inverse, matrix trace, matrix vectorization, diagonal matrix, spectral radius, minimum 
singular value, and Kronecker product, respectively; | • | will be used for the cardinality of a set and 
the magnitude of a scalar. The n x n identity matrix will be represented by I n and its i-th column by 
e^; while n denotes the n x 1 vector of all zeros, and nxp := n 0p. The ^-norm of vector x G W is 
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ll x llq := (Sf=i l^il 9 ) 1 ^ f° r 9^1- F° r matrices A, B G W nxn define the trace inner product (A, B) := 
tr(A'B). Also, recall that \\A\\f ■= \J tr ( AA') is the Frobenious norm, ||A||i := Yli j \ a ij\ * s tne 
norm, HAH^ := maxjj |ay| is the ^-norm, and ||A||* := ^o"j(A) is the nuclear norm. In addition, 
||A||i i := max|| x |j l=1 ||Ax||i = maxj ||e^A||i denotes the induced ^i-norm, and likewise for the induced 
£oo-norm HAH^oo := maxi| x ii =1 || Ax||oo = maxj ||Aej||i. For the linear operator A, define the operator 
norm ||„4|| := max||x|| F =i ||-4(X)||.f, which subsumes the spectral norm ||A|| := max|| x || =1 ||Ax||. Define 
also the support set supp(A) := : 7^ 0}. The indicator function l{ a= ;,} equals one when a = b, 

and zero otherwise. 

II. Local Identifiability 

The first issue to address is model identifiability, meaning that there are unique low-rank and sparse 
matrices satisfying (1). If there exist multiple decompositions of Y into X + RA with low-rank X and 
sparse A, there is no hope of recovering {Xo, Ao} from the data. For instance, if the null space of the 
fat matrix R contains sparse matrices, there may exist a sparse perturbation H such that A + H is still 
sparse and {Xo, Ao + H} is a legitimate solution. Another problematic case arises when there is a sparse 
perturbation H such that RH is spanned by the row or column spaces of Xo- Then, Xo + RH has the 
same rank as Xo and Ao — H may still be sparse. As a result, one may pick {Xo + RH, Ao — H} as 
another valid solution. Dealing with such identifiability issues is the subject of this section. 

Let USV denote the singular value decomposition (SVD) of Xo, and consider the subspaces: si) 
$(Xo) := {Z G R LxT : Z = UW^ + W 2 V, Wi G M Txr , W 2 G R Lxr } of matrices in either the 
column or row space of X ; s2) n(A ) := {H G R FxT : supp(H) C supp(A )} of matrices in R FxT 
with support contained in the support of A ; and s3) Qr(A ) := {Z G R LxT : Z = RH, H G O(A )}. 
For notational brevity, sl)-s3) will be henceforth denoted as {$,0,0^}. Noteworthy properties of these 
subspaces are: i) both and C M LxT , hence it is possible to directly compare elements from them; 
ii) Xo G $ and RA G fin; and iii) if Z G ( &- L is added to X , then rank(Z + X ) > r. 

For now, assume that the subspaces Q, r and are also known. This extra information helps identifiability 
of (1), because potentially troublesome solutions {Xo + RH, Ao — H} are limited to a restricted class. 
If Xo + RH ^ <I> or Ao — H ^ Q,, that candidate solution is not admissible since it is known a priori 
that Ao G and Xo G Under these assumptions, the following lemma puts forth the necessary and 
sufficient conditions guaranteeing unique decomposability of Y according to (1) - a notion known as 
local identifiability [10]. 

Lemma 1: Matrix Y uniquely decomposes into Xq + RAq if and only if $ n fijj = {Olxt}> ond 
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RH/0 Lx r,VHen\{O fxT }. 

Proof: Since by definition Xo E $ and Ao G £1, one can represent every element in the subspaces 
$ and Qr as Xo + Zi and RAo + Z2, respectively, where Zi G $ and Z2 G Or. Assume that $ n 
Ofl = {Olxt}, and suppose by contradiction that there exist nonzero perturbations {Zi,Z2} such that 
Y = Xo + Zi + RAo + Z2. Then, Zi + Z2 = OlxT, meaning that Zi and Z2 belong to the same 
subspace, which contradicts the assumption. Conversely, suppose there exists a non-zero Z G Or fl <£. 
Clearly, {X + Z,RA - Z} is a feasible solution where X + Z G <3? and RA - Z E f2 R . This 
contradicts the uniqueness assumption. In addition, the condition RH /0,He ri\{Oi X y} ensures that 
Z = OlxT e^nfifi only when Z = RH = OlxT for H = OfxT- ■ 
In words, (1) is locally identifiable if and only if the subspaces and Qr intersect transversally, and 
the sparse matrices in are not annihilated by R. This last condition is unique to the setting here, and 
is not present in [10] or [14]. 

Remark 1 (Projection operators): Operator Vq(X.) (Pq-l(X)) denotes the orthogonal projection of X 
onto the subspace SI (orthogonal complement fi- 1 ). It simply sets those elements of X not in supp(Ao) to 
zero. Likewise, ^(X) (V$,±(X.)) denotes the orthogonal projection of X onto the subspace $ (orthogonal 
complement <1>- L ). Let Pu := UU' and Py := W' denote, respectively, projection onto the column and 
row spaces of Xo- It can be shown that "P$(X) = P[/X + XPy — P[/XPy, while the projection onto 
the complement subspace is V$±(X.) = (I — P;y)X(I — Py). In addition, the following identities 

<7>*(X),7>*(Y)> = <P*(X),Y) = (X,P*(Y)> (2) 

of orthogonal projection operators such as V$(-), will be invoked throughout the paper. 

A. Incoherence measures 

Building on Lemma 1, alternative sufficient conditions are derived here to ensure local identifiability. 
To quantify the overlap between <I> and consider the incoherence parameter 

M(n*,*)= max J^flk. (3) 

for which it holds that /i(£Ir, <J?) G [0, 1]. The lower bound is achieved when $ and Qr are orthogonal, 
while the upper bound is attained when <3?nf2_R contains a nonzero element. Assuming <I>nr2_R = {Olxt}, 
then [i(Qr, <£) < 1 represents the cosine of the angle between <1> and Qr [17]. From Lemma 1, it appears 
that fJ,(&R, 3>) < 1 guarantees <3? n Qr = {Olxt}- As it will become clear later on, tighter conditions on 
/x(J1r, 3>) will prove instrumental to guarantee exact recovery of {Xq, Aq} by solving (PI). 
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To measure the incoherence among subsets of columns of R, which is tightly related to the second 
condition in Lemma 1, the restricted isometry constants (RICs) come handy [12]. The constant c)fc(R) 
measures the extent to which a fc-subset of columns of R behaves like an isometry. It is defined as the 
smallest value satisfying 

HRull 2 

c(l-<5 fc (R)) < ViiJ- <c(l + 4(R)) (4) 
ll u ll 

for every u G M F with ||u||o < k and for some positive normalization constant c < 1 [12]. For later use, 
introduce # Sl S2 (R) which measures 'how orthogonal' are the subspaces generated by two disjoint column 
subsets of R, with cardinality si and s 2 . Formally, 9 SuS2 (R) is the smallest value that satisfies 

|(Rui,Ru 2 )| < c0 Sl!S2 (R)||ui||||u 2 || (5) 

for every ui,U2 G M. F , where supp(ui) n supp(u 2 ) = and ||ui||o < s±, 1 1 u.2 1 1 o < S2- The normalization 
constant c plays the same role as in (^.(R). A wide family of matrices with small RICs have been 
introduced in e.g., [12]. 

All the elements are now in place to state this section's main result. 

Proposition 1: Assume that each column of Ao contains at most k nonzero elements. If h(£Ir, <!>) < 1 
and S k (R) < 1, then tt R n $ = {0 L xt} and RH / Ol x t,H G f2\{0TxT}- 

Proof: Suppose the intersection in nontrivial, meaning that there exists nonzero matrices HgH and 
UW'j + W2V G $ satisfying RH = UW'j + W 2 V. Vectorizing the last equation and relying on the 
identity vec(AXB) = (B' <g> A)vec(X), one obtains a linear system of equations 

[I T <8> R - It ® U - V ® I L ]w = LT (6) 

where w := [vec(H)' vec(W 1 ) vec(W 2 )]'. Define an LTxFT matrix Ci := I T ®R and the LTx (L+T)r 
matrix C 2 := [—It ® U — V <2> 1^]. The corresponding coefficients are wi := vec(H) and w 2 := 
[vecCWi) vec(W 2 )]'. Then, (6) implies there exists awi/ Ott such that C1W1 + C 2 w 2 = Olt- 

(i) 

Consider two cases: i) w 2 = O t (l+t)> an ^ ii) w 2 T O r (L+T)- Under i) C1W1 = Olt, and thus RwJ = 
for some nonzero with i G {1,2, ...,T} where wi = [wj'...w^]. Therefore, if ||w^||o < k, 
5fe(R) < 1 implies that Wj = Olt, which is a contradiction. For ii) fJ,(flR, < 1 implies that there 
is no wi with supp(wi) C supp(vec(Ao)) and w 2 G R( i+T ) r such that C1W1 + C 2 w 2 = Ott> since 
otherwise |(CiWi, C 2 w 2 )| = ||CiWi|| ||C 2 w 2 || which leads to h(Qr, $) = 1. ■ 

III. Exact Recovery via Convex Optimization 

In addition to fJ,(QR, there are other incoherence measures which play an important role in the 
conditions for exact recovery. Consider a feasible solution {Xq + ayRe^, Aq — a^ete^}, where ^ 
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supp(Ao) and thus aijeie 1 - Q,. It may then happen that a^Reje^ G $ and rank(Xo + ajj-Reje^) = 
rank(Xo) — 1, while ||Ao — ajjeje^||o = ||Ao||o + 1, challenging identifiability when and Q,r are 
unknown. Similar complications will arise if Xo has a sparse row space that could be confused with the 
row space of Ao- These issues motivate defining 

||P[/Reie'-||jr 

7it(U) :=max— — jf — , 7(V) := max ||Pyej||ir 

i,j ||Keje-||i? i 

where 7^(U),7(V) < 1. The maximum of 7_r(U) [7(V)] is attained when Re^ [e$] is in the column 

[row] space of Xo for some (i, j). Small values of 7/j(U) and 7(V) imply that the column and row 

spaces of Xo do not contain the columns of R and sparse vectors, respectively. 

Another identifiability issue arises when Xo = RH for some sparse matrix H G Q. In this case, each 

column of Xo is spanned by a few columns of R. Consider the parameter 

£ R (U,V) := HR'UV'lloo =max|e/R'UVeJ. 

i,j 

A small value of £r(U, V) implies that each column of Xo is spanned by sufficiently many columns of 
R. To understand this property, suppose for simplicity that all nonzero singular values of Xo are identical 
and equal to a, say. The k-th column of Xo is then YH=i °~ u i v i,k, and its projection onto the l-th column 
of R is 

r r 

i=i i=i 

Since the energy of Yll=i au i v i,k is somehow allocated along the directions Re^, if all the aforementioned 
projections can be made arbitrarily small, then sufficiently many nonzero terms in the expansion are needed 
to account for all this energy. 

A. Main result 

Theorem 1: Consider given matrices Y e M ixT and R G R LxF obeying Y = X + RA = USV + 
RAo, with r := rank(X.Q) and s := ||Ao||o- Assume that every row and column of Aq has at most k 
nonzero elements, and that R has orthonormal rows. If the following conditions 

I) (1 - n R )) 2 (l - 8 k (R)) > uj max ; and 

II) (l + a mfl j(i^)^(U,V)^ + ^(^,^)(l + 4(R)) 1/2 (l + a m «,)^< 1 
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hold, where 



<*W := 9 1A (R)[V2k + s 7 2 (V)] + (1 + ^(R)) [a/2&7&(U) + fc 7 2 (V) + s 7 2 (U) 7 2 (V) 

1 1 1/2 




(1 - n(Q R , 3>)) 2 (1 - 5 fc (R))w„7i - 1 



f/jen f/iere exists A > for which the convex program (PI) exactly recovers {Xo, Ao}. 

Note that I) alone is already more stringent than the pair of conditions fJ,(0,R, 3>) < 1 and <5fc(R) < 1 
needed for local identifiability (cf. Proposition 1). Satisfaction of the conditions in Theorem 1 hinges upon 
the values of the incoherence parameters //(Or, $),7r(U),7(V), £r(U, V), and the RICs <5fc(R) and 
#i l(R). In particular, {w max , a max , /3 max } are increasing functions of these parameters, and it is readily 
observed from I) and II) that the smaller {w max , a max , /3 max } are, the more likely the conditions are met. 
Furthermore, the incoherence parameters are increasing functions of the rank r and sparsity level s. The 
RIC £fc(R) is also an increasing function of k, the maximum number of nonzero elements per row/column 
of Ao- Therefore, for sufficiently small values of {r, s, k}, the sufficient conditions of Theorem 1 can be 
indeed satisfied. 

It is worth noting that not only s, but also the position of the nonzero entries in Ao plays an important 
role in satisfying I) and II). This is manifested through k, for which a small value indicates the entries 
of Ao are sufficiently spread out, i.e., most entries do not cluster along a few rows or columns of Ao- 
Moreover, no restriction is placed on the magnitude of these entries, since as seen later on it is only the 
positions that affect optimal recovery via (PI). 

Remark 2 (Row orthonormality of R): Assuming RR' = I/, is equivalent to supposing that R is full- 
rank. This is because for a full row-rank R = UrXIrVr', one can pre-multiply both sides of (1) with 
Ur' to obtain R := Vr' with orthonormal rows. 

B. Induced recovery results for principal components pursuit and compressed sensing 

Before delving into the proof of the main result, it is instructive to examine how the sufficient conditions 
in Theorem 1 simplify for the subsumed PCP and CS problems. In PCP one has R = II, which implies 
Qr = Q and <5fc(R) = 01,1 (R) = 0. To obtain sufficient conditions expressed only in terms of //(<!>, O), 
one can borrow the coherence conditions of [10] and readily arrive at the following result. 

Corollary 1: Consider given Y £ R LxT obeying Y = X + A = USV + A , with r := rank(X ) 
and s := ||Ao||o- Suppose the coherence conditions 7(U) := maxj ||P;yei || < yJprjL, 7(V) < y/pr/T, 
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and £(U, V) := HUV'Hoo < yj pr/LT hold for some positive constant p. If //(<&, fi) is sufficiently small 
such that the following conditions 

X) < Q) < 1 — yjuj max ; and 

XT) (l + tW^KSfc) ^ + ^($,0)} < 1 



/ioZg?, where 



t 1 1 

<*W := prfe ( — + — 

1 



A, 



(i-m($,Q)) 2 
i 



i 



1/2 



(i -/*(<&, n^uw- 1 )-! 

f/ierc ?/iere exists A > 0/or which the convex program (PI) with R = 1^ exactly recovers {Xo, Ao}. 
In Section V, random matrices {Xo, Ao,R} drawn from natural ensembles are shown to satisfy I) and 
II) with high probability. In this case, it is possible to arrive at simpler conditions (depending only on r, 
s, and the matrix dimensions) for exact recovery in the context of PCP; see Remark 6. Corollary 1, on 
the other hand, offers general conditions stemming from a purely deterministic approach. 

In the CS setting one has X = OlxT, which implies = £r(U,V) = 7r(U) = 7(V) = 0. 

As a result, Theorem 1 simply boils down to a RIC-dependent sufficient condition for the exact recovery 
of Ao as stated next. 

Corollary 2: Consider given matrices Y 6 R LxT and R G M Lxir obeying Y = RA . Assume that the 
number of nonzero elements per column of Ao does not exceed k. If 

$ fc (R) + fc0i,i(R)<l (7) 

holds, then (PI) with X = OlxT exactly recovers Aq. 

To place (7) in context, consider normalizing the rows of R. For such a compression matrix it is 
known that 5fe(R) < (k — l)#i i(R), see e.g., [31]. Using this bound together with (7), one arrives at 
the stricter condition k < ~ ^1 + ^^}(R)^. This last condition is identical to the one reported in [19], 
which guarantees the success of i^-norm minimization in recovering sparse solutions to under-determined 
systems of linear equations. The conditions have been improved in recent works; see e.g., [31] and 
references therein. 



IV. Proof of the Main Result 

In what follows, conditions are first derived under which {Xo, Ao} is the unique optimal solution of 
(PI). In essence, these conditions are expressed in terms of certain dual certificates. Then, Section IV-B 
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deals with the construction of a valid dual certificate. 

A. Unique optimality conditions 

Recall the nonsmooth optimization problem (PI), and its Lagrangian 

£(X,A,M) = ||X||* + A||A||i + (M,Y-X-RA) (8) 

where M S M ixT is the matrix of dual variables (multipliers) associated with the constraint in (PI). From 
the characterization of the subdifferential for nuclear- and £i-norm (see e.g., [8]), the subdifferential of 
the Lagrangian at {Xo, Ao} is given by (recall that Xo = USV) 

dx£(Xo,A ,M) = {UV'+W-M : ||W|| < 1, 7>$(W) = LxT } (9) 
5 A /:(Xo,Ao,M) = {A S ign(Ao) + AF-R / M: ||F|U < 1, V Q (F) = FxT } . (10) 

The optimality conditions for (PI) assert that {Xo, Ao} is an optimal (not necessarily unique) solution if 
and only if 

FxT G d A £(X , A ,M) and LxT G d x £(X , A , M). 

This can be shown equivalent to finding the pair {W,F} that satisfies: i) ||W|| < 1, 7^$(W) — O^x^; 
ii) ||F||oo < 1, Vn{F) = FxT ; and iii) Asign(A ) + AF = R'(UV' + W). In general, i)-iii) may hold 
for multiple solution pairs. However, the next lemma asserts that a slight tightening of the optimality 
conditions i)-iii) leads to a unique optimal solution for (PI). See Appendix A for a proof. 

Lemma 2: Assume that each column of Aq contains at most k nonzero elements, as well as /i(fi.R, <3?) < 1 
and 5k (EL) < 1. If there exists a dual certificate T € M LxT satisfying 

CI) V*(T) = UV 

C2) Vn(EL'T) = Asgn(A ) 

C3) ||^(r)||<i 

C4) ||P n x(RT)|| 0O <A 
then {Xo, Ao} is the unique optimal solution of (PI). 

The remainder of the proof deals with the construction of a dual certificate T that meets Cl)-C4). To 
this end, tighter conditions [I) and II) in Theorem 1] for the existence of T are derived in terms of the 
incoherence parameters and the RICs. For the special case EL = If,, the conditions in Lemma 2 boil down 
to those in [14, Prop. 2] for PCP However, the dual certificate construction techniques used in [14] do 
not carry over to the setting considered here, where a compression matrix R is present. 
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B. Dual certificate construction 

Condition CI) in Lemma 2 implies that V = UV' + (I - P[/)X(I - P v ), for arbitrary X G R LxT (cf. 
Remark 1). Upon denning Z := R(I - Pp)X(I - P p ) and B n := Asign(A ) - P n (R'UV'), CI) and 
C2) are equivalent to Vn(Z) = Bq. 

To express Vn(Z) = in terms of the unrestricted matrix X, first vectorize Z to obtain vec(Z) = 
[(I - Py) <g> R'(I - P{/)] vec(X). Define A := (I-Py)<g>R'(I-P;y) and an s x LT matrix A n formed 
with those s rows of A associated with those elements in supp(Ao). Likewise, define Ajji which collects 
the remaining rows from A such that A = II [A^, A^ ± ]' for a suitable row permutation matrix II. Finally, 
let hn be the vector of length s containing those elements of Bq with indices in supp(Ao). With these 
definitions, CI) and C2) can be expressed as Aqvcc(X) = b^. 

To upper-bound the left-hand side of C3) in terms of X, use the assumption RR' = II to arrive at 

||7V(r)|| = ||R'(I - Pc/)X(I - P v )|| < ||R'(I - Pc/)X(I - P V )\\ F = ||Avec(X)||. 
Similarly, the left-hand side of C4) can be bounded as 

||7V(R'r)|U = ||^(z) + ^(R / uv / )||oo 

< ||^(Z)|| 00 + ||^(RUV / )|| 00 

= ||A ni vec(X)|| 00 + HPn^R'UVOlloc. 

In a nutshell, if one can find X G R LxT such that 
cl) A c vec(X) = b n 
c2) ||Avec(X)|| < 1 

c3) ||A^vec(X)|| 00 + HPn^R'UVOlU < A 
hold for some positive A, then Cl)-C4) would be satisfied as well. 

The final steps of the proof entail: i) finding an appropriate candidate solution X such that cl) holds; 
and ii) deriving conditions in terms of the incoherence parameters and RICs that guarantee X meets 
the required bounds in c2) and c3) for a range of A values. The following lemma is instrumental to 
accomplishing i), and its proof can be found in Appendix B. 

Lemma 3: Assume that each column of Aq contains at most k nonzero elements, as well as fJ,(flR, <3?) < 1 
and 5fc(R) < 1. Then matrix A^ has full row rank, and its minimum singular value is bounded below as 

(r min (A' a ) > c^il - 4(R)) 1/2 (1 - 
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According to Lemma 3, the least-norm (LN) solution Xln := argminx {||X|||, : Anvec(X) = bn} 
exists, and is given by 

vec(X LN ) = A' n {AnA'^y 1 b n . (11) 

Remark 3 (Candidate dual certificate): From the arguments at the beginning of this section, the can- 
didate dual certificate is f := UV' + (I - P;y)X LN (I - Py). 

The LN solution is an attractive choice, since it facilitates satisfying c2) and c3) which require norms 
of vec(X) to be small. Substituting the LN solution (11) into the left hand side of c2) yields (define 
Q := Aq± A' n (AqA^)^ 1 for notational brevity) 

A Q 



|Avec(X L N) 



A' n (An A' n ) 1 bn 



< (i + IIQI 



(12) 



Moreover, substituting (11) in the left hand side of c3) results in 

HQbnlloc + ^^(R'UVOHoc < ||Q||oo,oo[|bn]|oo + IPV(H.'UV')IU- (13) 
Next, upper-bounds are obtained for ||Q|| and ||Q||oo,oo; see Appendix C for a proof. 
Lemma 4: Assume that each column and row of Aq contains at most k nonzero elements. If h(£Ir, <3?) < 1 
and <5fc(R) < 1 hold, then 

m - " max := I^FWFMW " 1 

If the tighter condition I) holds instead, then 

^max 



1/2 



IQI 



</?n 



" (1 - /i(f^,$)) 2 (l - 6 k (R)) - Umax' 

Going back to (12)-(13), note that ||Bn||oo = ||bn||oo an( i I|Bq||f = ||bn||, which can be respectively 
upper-bounded as 



|Bn||oo = ||Asign(A ) - Pn(R'UV')|U < A + ||P n (B/UV')||oc 
||B n ||F = ||Asign(A ) - P n (R'UV')|k < A^ + ||7^(R'UV')|| F . 



(14) 
(15) 
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Finally, ||7 :> n(R / UV / )||.F itself can be bounded above as 

||P n (R'UV')||! = |(P n (RW),7' n (R'UV))| ( = } |(R / UV / ,^(R'UV'))| 

= |(UV / ,RP n (R / UV / ))| = |(P <E ,(UV / ),^(R^(R / UV / )))| 

< ||P <I) (UV / )||f||^(R^(R / UV / ))||f 

(d) 

< ||UV , || F/ u(4>,fi r )||R^(RUV , )||F 

< v^M(*,^r)c 1/2 (l + ^(R)) 1/2 ||^(R / UV / )||ir (16) 

where (a) is due to (2), (b) follows because UV' 6 $ (thus P$(UV') = UV') and from the property in 
(2). Moreover, (c) is a direct result of the Cauchy-Schwarz inequality, while (d) and (e) come from (3) 
and (4), respectively, and the assumption that number of nonzero elements per column of Ao does not 
exceed k. All in all, \\VnCR'UV')\\ F < y/fn($, Q R )^ 2 {1 + 5 fc (R)) 1/2 and (15) becomes 

\\B n \\ F < A^+vM$,^r)c 1/2 (l + <S fc (R)) 1/2 . (17) 

Upon substituting (14), (17) and the bounds in Lemma 4 into (12) and (13), one finds that c2) and c3) 
hold if there exists A > such that 



(1 + "max) [AV^ + y/fn(n R , $)C 1 / 2 (1 + ^.(R)) 1 



/2 



< 1 (18a) 



/3max (A + ||Pn(R'UV , )||oo) + IITVtR'UVOHoo < A (18b) 

hold. Recognizing that f fl (U,V) = max{||Pn(R / UV')||oo, WVq^CR'UV')^}, the left-hand side of 
(18b) can be further bounded. After straightforward manipulations, one deduces that conditions (18a) and 
(18b) are satisfied for A G (A m j n ,A max ), where 

\ i Pmax / 

A m ax := 4= + a ™* )^ ~ vfy(«ii, $>)c l ' 2 (l + 5 fc (R)) 1/2 
L 

Clearly, it is still necessary to ensure A max > A mm so that the LN solution (11) meets the requirements cl)- 
c3) [equivalently, F in Remark 3 satisfies Cl)-C4) from Lemma 2]. Condition A max > A min is equivalent 
to II) in Theorem 1, and the proof is now complete. 

Remark 4 (Satisfiability): From a high-level vantage point, Theorem 1 asserts that (PI) recovers {Xo, Ao} 
when the components Xo and RAo are sufficiently incoherent, and the compression matrix R has good 
restricted isometry properties. It should be noted though, that given a triplet {Xo, Ao,R} in general one 
cannot directly check whether the sufficient conditions I) and II) hold, since e.g., <5fc(R) is NP-hard to 
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compute [12]. This motivates finding a class of (possibly random) matrices {Xo, Ao, R} satisfying I) and 
II), the subject dealt with next. 

V. Matrices Satisfying the Conditions for Exact Recovery 

This section investigates triplets {X , A , R} satisfying the conditions of Theorem 1, henceforth termed 
admissible matrices. Specifically, it will be shown that low-rank, sparse, and compression matrices drawn 
from certain random ensembles satisfy the sufficient conditions of Theorem 1 with high probability. 

A. Uniform sparsity model 

Matrix Ao is said to be generated according to the uniform sparsity model, when drawn uniformly at 
random from the collection of all matrices with support size s. There is no restriction on the amplitude 
of the nonzero entries. An attractive property of this model is that it guarantees (with high probability) 
that no single row or column will monopolize most nonzero entries of Ao, for sufficiently large Ao and 
appropriate scaling of the sparsity level. This property is formalized in the following lemma (for simplicity 
in exposition it is henceforth assumed that that Ao is a square matrix, i.e., F = T). 

Lemma 5: [14] If Aq £ M FxF is generated according to the uniform sparsity model with ||Ao||o = s, 
then the maximum number k of nonzero elements per column or row of Ao is bounded as 

k<j\og(F) 

with probability higher than 1 — 0{F~^), for s = 0{C,F). 

In practice, it is simpler to work with the Bernoulli model that specifies supp(Ao) = '■ hj = 1}, 

where {bij} are independent and identically distributed (i.i.d.) Bernoulli random variables taking value 
one with probability ir := s/F 2 , and zero with probability 1 — it. There are three important observations 
regarding the Bernoulli model. First, |supp(Ao)| is a random variable, whose expected value is s and 
matches the uniform sparsity model. Second, arguing as in [10, Lemma 2.2] one can claim that if (PI) 
exactly recovers {Xo, Ao} from data Y = Xo + RAo, it will also exactly recover {Xo, Ao} from 
Y = Xo + RAo when supp(Ao) C supp(Ao) and the nonzero entries coincide. Third, following the logic 
of [11, Section II.C] one can prove that the failure rate 1 for the uniform sparsity model is bounded by 
twice the failure rate corresponding to the Bernoulli model. As a result, any recovery guarantee established 
for the Bernoulli model holds for the uniform sparsity model as well. 

'The failure rate is defined as Pr(A ^ Ao), where A is the solution of (PI). 
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In addition to the bound for k in Lemma 5, the Bernoulli model can be used to bound //(<&, fi#) in 
terms of the incoherence parameters {7_r(U), 7(V)} and the RIC <5fc(R). For a proof, see Appendix D. 

Lemma 6: Let A := ^/c(l + <*i(R)) [t|(U) + 7 2 (V)] 1/2 and n := max{L, F}. Suppose A G M FxF 
w generated according to the Bernoulli model with Pr(6j = 1) = ir, and RR' = II- Then, there exist 
positive constants C and r such that 



A*($, Ob) < ^c-^l-^^R))- 1 ^ \cAy/\og(LF)/ir + rA log(n) + 1 



1/2 



(19) 



/lo/cfs w/f/i probability at least 1 — n CnAr if <5fc(R) cmcf f/ie right-hand side of (19) cfo rcof exceed one. 1 
Consider (19) when A is small enough so that the quantity inside the square brackets is close to one. 



One obtains /j,($,Qr) < \Jc~ l {l — ^(R))" 1 ^, which reduces to the bound ju($, Q) < \/tt derived in 
[10, Section 2.5] for the special case R = 1^. Hence, the price paid in terms of coherence increase due 



to R is roughly \J c~ 1 (l — ^(R)) _1 > 1. As expected, (19) also shows that for R with small RICs the 
incoherence between subspaces $ and Qr becomes smaller, and identifiability is more likely. 

The result in Lemma 6 allows one to 'eliminate' / u( < l > , Qr) from the sufficient conditions in Theorem 1, 
which can thus be expressed only in terms of {7#(U),7(V),£r(U, V)} and the RICs of R. In the 
following sections, random low-rank and compression matrices giving rise to small incoherence parameters 
and RICs are described. 



B. Random orthogonal model 

Among other implications, matrices Xo and R with small Jr(U) and £r(U, V) are such that the 
columns of R (approximately) fall outside the column space of Xo- From a design perspective, this 
suggests that the choice of an admissible Xo (or in general an ensemble of low-rank matrices) should 
take into account the structure of R, and vice versa. However, in the interest of simplicity one could 
seek conditions dealing with Xo and R separately, that still ensure 7ij(U) and £r(U, V) are small. This 
way one can benefit from the existing theory on incoherent low-rank matrices developed in the context of 
matrix completion [9], and matrices with small RICs useful for CS [11], [31]. Admittedly, the price paid 
is in terms of stricter conditions that will reduce the set of admissible matrices. 

In this direction, the next lemma bounds 7r(U) and £r(U, V) in terms of 7(U) := maxj ||Pf/ej||, 
7 (V) and 5 fc (R). 

2 Even though one has n = F and iv = s / F 2 in the problem studied here, Lemma 6 is stated using n and ty to retain generality. 



IEEE TRANSACTIONS ON INFORMATION THEORY (SUBMITTED) 



15 



Lemma 7: If r](R) := maxj ||Re.j||i/||Rej||, it then holds that 

7fl (U) < ??(R)7(U) (20) 
&(U,V) < v / ^HM#)7(U)7(V). (21) 



Proof: Starting from the definition 



IIPc/ReiH ||Pf/^ e ^ Re i 
7h(UJ = max — — — - — = max ■ 



I R e « II * II R e t I 



W E^llP^llle^Reil ft) llR^lli 

< max — J [t- 5 < 7(Ujmax-r — (22) 

i l|P e i|| ' ll Re i|| 

where (a) follows from the Cauchy-Schwarz inequality, and (b) from the definition of 7(U). 
Likewise, applying the definition of £r(U, V) one obtains 

(c) 

£r(U,V) = maxle-R'UV'ejl < max ||U'Re-|| max ||V'ej|| 

i,j i i 

(d) 



< Vc(l + 5i(R))7fl(U)7(V) < Vc(l + <yi(R))»7(R)7(U)7(V) (23) 

where (c) follows from the Cauchy-Schwarz inequality, and (d) is due to (22). ■ 
The bounds (20) and (21) are proportional to 7(U) and 7(V). This prompts one to consider incoherent 
rank-r matrices Xo = US V generated from the random orthogonal model, which is specified as follows. 
The singular vectors forming the columns of U and V are drawn uniformly at random from the collection 
of rank-r partial isometries in M Lxr and M Fxr , respectively. There is no need for U and V to be statistically 
independent, and no restriction in placed on the singular values in the diagonal of S. The adequacy of the 
random orthogonal model in generating incoherent low-rank matrices is justified by the following lemma 
(recall T = F > L). 

Lemma 8: [14] If~Ko = USV' e M. LxF is generated according to the random orthogonal model with 
rank(X.o) = r, then 



! m\ nrw s , max{r, log(F)} 
max{7(U),7(V)} < W 

with probability exceeding 1 — 0{F~ ?J log(F)). 



C. Random compressive matrices 

With reference to Lemma 7 [cf. (20) and (21)], it is clear that an incoherent Xo alone may not suffice 
to yield small 7r(U) and £r(U, V). In addition, rj(R) € [1,\/L] should be as close as possible to one. 
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This can be achieved e.g., when R is sparse across each column. Note that the lower bound of unity is 
attained when R has at most a single nonzero element per column, as it is the case when R = 1^. 

The aforementioned observations motivate considering block-diagonal compression matrices R G M. LxF , 
consisting of blocks {R; G ffi^ x ^} where I < f. The number of blocks is n& := F/f assuming that / 
divides F. The i-th block is generated according to the bounded orthonormal model as follows; see 
e.g., [31]. For some positive constant K, (deterministically) choose a unitary matrix \l/ G Rf x f with 
bounded entries 

max |* t J < K (24) 

where F := {1,...,/}. For each i = 1, ...,n& form Rj := @ T ( i )\E r , where r <*> := [e.« , . . . , e.co]' G 
K J is a random row subsampling matrix that selects the rows of * indexed by T W := {tf ] , ...,tf ] } C F. 
In words, 0y(o is formed by those I rows of If indexed by The row indices in are selected 
independently at random, with uniform probability 1// from F. By construction, RjR'^ = 1^, i = 1, . . . , rib, 
which ensures RR' = I/, as required by Theorem 1. Most importantly, the next lemma states that such 
a construction for Rj leads to small RICs with high probability; see e.g., [31] for the proof. 

Lemma 9: [31] Let Rj G M. ix f be generated according to the bounded orthonormal model. If for some 
hi G [1, /], e G (0, 1) and fi G (0, 1/2] the following condition 

1 * > ^K 2 ^- 2 S log 2 (100fe J )log(4/)log(7 e - 1 ) (25) 
log(ltM) 

holds where the constant D < 243, 150, then (R^) < /i with probability greater than 1 — e. 

Lemma 9 asserts that for large enough £, the RIC ^(Ri) = 0(log(lOOfci) log(lOf) log(4/) 1 / 2 yW^) 

with overwhelming probability. 

Let ki denote the maximum number of nonzero elements per 'trimmed' column of Ao, the trimming 
being defined by the block of rows of Ao that are multiplied by Rj when carrying out the product RAo- 
With these definitions, the RIC of R is bounded as Jjfc(R) < maxjj^ (Rj)}. For Jjfc(R) to be small 
as required by Theorem 1, the k{ should be much smaller than I. Since Ao is generated according to 
the uniform sparsity model outlined in Section V-A, its nonzero elements are uniformly spread across 
rows and columns as per Lemma 5. Formally, it holds that ki < k := (s/Frib) log(Fnb) with probability 
1 — O^FubY^), where s = \\ Ao ||o = C^ n b', see e.g., [6]. Accordingly, from Lemma 9 one can infer that 
5 k (R) = O(log(100K)log(10^)log(4/) 1 /2y^) W i t h high probability. Note that the bound for 5 k (R) 
depends on k through the variable s in k, and the relationship between s and k in Lemma 5. Regarding 
the RIC 01,1 (R), it is bounded as 9i t x (R) < ^(R) [12]. The normalization constant c in (4) and (5) 
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also equals L/F < 1. Recalling ??(R) (cf. Lemma 7) which was subject of the initial discussion in this 
section, it turns out that for such a construction of R one obtains rj(R) < VI <C \[L. 
Remark 5 (Row and column permutations): The class of admissible compression matrices can be ex- 
tended to matrices which are block diagonal up to row and column permutations. Let n r (II C ) denote, 
respectively, the row (column) permutation matrices that render R block diagonal. Instead of (1) consider 
II r Y = n r Xo + n r Rn c n' c Ao and note that n r Xo has the same coherence parameters as Xo, while 
n r Rn c has the same RICs as R, and H' c Aq is still uniformly sparse. Thus, one can feed the transformed 
data to (PI) and since n r and n c are invertible, {Xo, Ao} can be readily obtained from the recovered 

{n r x ,n' c A }. 

D. Closing the loop 

According to Lemmata 6 and 7, the incoherence parameters ^i(^,Qr), 7k(U) and £r(U,V) which 
play a critcal role toward exact decomposability in Theorem 1, can be upper-bounded in terms of 7(U) and 
7(V). For random matrices {Xo, Ao, R} drawn from specific ensembles, Lemmata 5, 8 and 9 assert that 
the incoherence parameters 7(U) and 7(V) as well as the RICs 5k(R) and #i i(R), are bounded above in 
terms of r = rank(Xo), the degree of sparsity s = ||Ao||o> and the underlying matrix dimensions L, F, £, f. 
Alternative sufficient conditions for exact recovery, expressible only in terms of the aforementioned basic 
parameters, can be obtained by combining the bounds of this section along with I) and II) in Theorem 1. 
Hence, in order to guarantee that (PI) recovers {Xo,Ao} with high probability and for given matrix 
dimensions, it suffices to check feasibility of a set of inequalities in r and s. 

To this end, focus on the asymptotic case where L and F are large enough, while F = T for simplicity 
in exposition. Recall the conditions of Theorem 1 and suppose <5&(R) = o(l) and SIr) = o(l). 
This results in a max « \/F/L and (3 max « (w~Lc ~~ I) -1 when L <g F. Satisfaction of I) and II) 
then requires 0(1) summands in the left-hand side of II), which gives rise to £r(U, V) = G(yf L/ Fs), 
Qr) = 0{yj L/ Fr), and w m ax = 0{\) < 1. The latter which is indeed the bottleneck constraint can 
be satisfied if U (R) = 0(l/k), ^i,i(R)7 2 (V) = 0(l/s), 7 |(U) = 0(l/k), 7 2 (V) = 0(l/k), and 
7r(U)7r(V) = 0(1/ s). Utilizing the bounds in Lemmata 6-9 establishes the next corollary. 

Corollary 3: Consider given matrices Y G R LxF and Re R LxF obeying Y = Xo + RAo, where r := 
rank(X.o) and s := || Ao||o- Suppose that: (i) Xo is generated according to the random orthogonal model; 
(ii) Ao is generated according to the uniform sparsity model; and (ii) R = bdiag (Ri, . . . , Rn b ) with 
blocks Rj G M/ X f generated according to the bounded orthogonal model. Define f := max{r, log(F)}. 
If r and s satisfy 
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i) r^i 

ii) c ~^ min f p2 p2 \ 



iii) S V2i og h 00 |£log(^)) -< 



F 2 £ 

^7 



_/log(F 2 //)log*(/)_ 

f/iere is a positive Xfor which (PI) recovers {Xo, Ao} with high probability. 

Remark 6 (Principal components pursuit): For PCP where R = 1^ and L = T (cf. Corollary 1), it 
can be readily verified that s min{r, log(L)} = 0(L 2 / log(L)) suffices for exact recovery of {Xo, Ao} 
by solving (PI). This guarantee is of course valid with high probability, provided {Xo, Ao, R} are drawn 
from the random matrix ensembles outlined throughout this section. However, in the presence of the 
compression matrix R more stringent conditions are imposed on the rank and sparsity level, as stated 
in Corollary 3. This is mainly because of the dominant summand [y2k + s7 2 (y)]#i 5 i(R) in ui max (cf. 
Theorem 1), which limits the extent to which r and s can be increased. If the correlation between any 
two columns of R is small, then higher rank and less sparse matrices can be exactly recovered. 

VI. Algorithms 

This section deals with iterative algorithms to solve the non-smooth convex optimization problem (PI). 



A. Accelerated proximal gradient (APG) algorithm 

The class of accelerated proximal gradient algorithms were originally studied in [29], [30], and they 
have been popularized for ^i-norm regularized regression; mostly due to the success of the fast iterative 
shrinkage-thresholding algorithm (FISTA) [2]. Recently, APG algorithms have been applied to matrix- 
valued problems such as those arising with nuclear-norm regularized estimators for matrix completion [36], 
and for (stable) PCP [24], [40]. APG algorithms offer several attractive features, most notably a conver- 
gence rate guarantee of 0(l/y/e) iterations to return an e— optimal solution. In addition, APG algorithms 
are first-order methods that scale nicely to high-dimensional problems arising with large networks. 

The algorithm developed here builds on the APG iterations in [24], proposed to solve the stable PCP 
problem. One can relax the equality constraint in (PI) and instead solve 

(P2) mm |i/||X||* + ^A||A||i + ^||Y-X-RA|||| 

with S := [X', A']', where the least-square term penalizes violations of the equality constraint, and v > 
is a penalty coefficient. When v approaches zero, (P2) achieves the optimal solution of (PI) [3]. The 
gradient of /(S) := |||Y — X — RA|||i is Lipschitz continuous with a (minimum) Lipschitz constant 
L f = A max ([I L R]'[I L R]), i.e., ||V/(Si) - V/(S 2 )|| < L/||Si - S 2 ||, VSi, S 2 in the domain of /. 
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Instead of directly optimizing the cost in (P2), APG algorithms minimize a sequence of overestimators, 
obtained at judiciously chosen points T. Define g(S) := ^||X|| + + fA||A||i and form the quadratic 
approximation 

Q(S, T) := /(T) + (V/(T), S - T) + ^-||S - T||| + g(8) 

= ^||S - G||| + g(S) + /(T) - J-||V/(T)||| (26) 

where G := T - (1/L/)V/(T). With k = 1,2,... denoting iterations, APG algorithms generate the 
sequence of iterates 

S[fc] := argminQ(S,T[A;]) = argmin ( ^||S - G[k] \\ 2 F + g(S)\ (27) 
s s I 2 

where the second equality follows from the fact that the last two summands in (26) do not depend on S. 
There are two key aspects to the success of APG algorithms. First, is the selection of the points T[k] where 
the sequence of approximations Q(S, T[k]) are formed, since these strongly determine the algorithm's con- 
vergence rate. The choice T[Jfe] = S[fc]+ * [fe ^|~ 1 (S[k] — S[fc — 1]), where t[k] = 1 + ^/4t 2 [k - 1] + 1 /2, 
has been shown to significantly accelerate the algorithm resulting in convergence rate no worse than 
0(l/k 2 ) [2]. The second key element stems from the possibility of efficiently solving the sequence of 
subproblems (27). For the particular case of (P2), note that (27) decomposes into 

X[fc + 1]:= argmin |^||X-Gx[A;] ||| + ^||X||* 1 (28) 

A[Ar + l] := argrmn j^||A-G A [fc]||! + z/A||A||ij (29) 

where G[k] = [G' x [k] G' A [k]]'. Letting <S T (M) with (z,j)-th entry given by sign(mjj) max{|raj j| — r, 0} 
denote the soft-thresholding operator, and USV' = svd(Gx[&]) the singular value decomposition of 
matrix Gx[&], it follows that (see, e.g. [24]) 

X[fc + 1] =U5_^[E]V, A[k + l]=S^[G A [k]}. (30) 

A continuation technique is employed to speed-up convergence of the APG algorithm. The penalty 
parameter v is initialized with a large value vq, and is decreased geometrically until it reaches the target 
value of v. The APG algorithm is tabulated as Algorithm 1. Similar to [24] and [36], the iterations 
terminate whenever the norm of 

L f (T x [k] - X[k + 1]) + (K[k + 1] + RA[k + 1] - T x [k] - KT A [ 

Z[k + 1J := 

L f (T A [k] - A[k + 1]) + R'(X[fc + 1] + RA[fc + 1] - T x [k] - BT A [k}) 
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Algorithm 1 : APG solver for (PI) 



input Y, R, A, v, i/o, i>L f = A max ([I L R]'[I L R]) 

initialize X[0] = X[-l] = LxT , A[0] = A[-l] = FxT , t[0] = t[-l] = 1, and set k = 0. 
while not converged do 

T x [k] = X[k] + fcip (X[fc] - X[fc - 1]). 
T A [k] = A[k] + (A[fc] - A[k 1]). 

G x [k] = T x [k] + rj (Y - T x [k] - BT A [k]). 
G A [k] = T A [k] + ^R (Y - T x [k] - RT A [k]). 
USV = svd(G x [k]), X[fc + 1] = US v[k]/Lf (S)V'. 



A[k + l]=S Xv[k]/Lf (G A [k]). 



t[k + l}= 1 + v/4t 2 [fc] + 1 /2 
+ 1] = max{t)^[A;], z/} 

fc <- fc + 1 
end while 
return X[fc], A[k] 



drops below some prescribed tolerance, i.e., ||Z[A; + 1]||f < tol x max(l, L^||X[fc] \\f)- As detailed in [36], 
the quantity \\2i\k + 1] \\f upper bounds the distance between the origin and the set of subgradients of the 
cost in (P2), evaluated at S[k + 1], 

Before concluding this section, it is worth noting that Algorithm 1 has good convergence performance, 
and quantifiable iteration complexity as asserted in the following proposition adapted from [2], [24]. 

Proposition 2: [24] Let h{.) and {A,X} denote, respectively, the cost and an optimal solution of (P2) 
when v := v. For k > ko '■= » the iterates {A[k], X[fc]} generated by Algorithm 1 satisfy 

|MA W ,XW) - „(A.x)| < W^JXN-^) 



B. Alternating-direction method of multipliers (AD-MoM) algorithm 

The AD-MoM is an iterative augmented Lagrangian method especially well-suited for parallel process- 
ing [4], which has been proven successful to tackle the optimization tasks encountered e.g., in statistical 
learning problems [27], [7]. While the AD-MoM could be directly applied to (PI), R couples the entries of 
A and it turns out this yields more difficult £i-norm minimization subproblems per iteration. To overcome 
this challenge, a common technique is to introduce an auxiliary (decoupling) variable B, and formulate 
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the following optimization problem 

(P3) min IIXL + All Alh 

{X,A,B} 

s.to Y = X + RB (31) 
B = A (32) 

which is equivalent to (PI). To tackle (P3), associate Lagrange multipliers M and M with the con- 
straints (31) and (32), respectively. Next, introduce the quadratically augmented Lagrangian function 

£(X, A, B, M, M) =||X||» + A||A||i + (M,B - A) + (M, Y - X - RB) 



+ |||Y-X-RB||| + |||A-B||| (33) 



where c is a positive penalty coefficient. Splitting the primal variables into two groups {X, A} and {B}, 
the AD-MoM solver entails an iterative procedure comprising three steps per iteration k = 1,2,... 
[SI] Update dual variables: 

M[Jfe] = M[k - 1] + c(B[ife] - A[k]) (34) 

M[Jfe] =M[k- 1] + c(Y-X[fc] -RB[Jfe]) (35) 

[S2] Update first group of primal variables: 

X[k + 1] = arg min || || Y - X - BB[k]\\ 2 F - (M[Ar],X) + ||X||»} . (36) 

A[fc + l]=arg rmn{|||A-B[fc]|||-(M[A;],A) + A||A||i}. (37) 

[S3] Update second group of primal variables: 

B[fc+1] = arg min |-||Y - X[jfe + 1] - RB||| + -\\A[k + 1] - B||| - (R'M[fc] - M[jb],B) j 
B 12 2 J 

(38) 

This three-step procedure implements a block-coordinate descent on the augmented Lagrangian, with dual 
variable updates. The minimization (36) can be recast as (28), hence X[fc + 1] is iteratively updated through 
singular value thresholding. Likewise, (37) can be put in the form (29) and the entries of A [A; + 1] are 
updated via parallel soft-thresholding operations. Finally, (38) is a strictly convex unconstrained quadratic 
program, whose closed-form solution is obtained as the root of the linear equation corresponding to 
the first-order condition for optimality. The AD-MoM solver is tabulated under Algorithm 2. Suitable 
termination criteria are suggested in [7, p. 18]. 
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Algorithm 2 : AD-MoM solver for (PI) 



input Y, R, A, c 

initialize X[0] = M[-l] = LxT , A[0] = B[0] = M[-l] = FxT , and set k = 0. 
while not converged do 

[SI] Update dual variables: 

M[k] = M[k - 1] + c(B[k] - A[k]) 

M[k] = M[k - 1] + c(Y - X[fc] - RB[fc]) 

[S2] Update first group of primal variables: 

UEV = svd(Y - RA[fc] + c^M^]), X[fc + 1] = U«S 1/c (S)V. 

A[fe + 1] = c^SxiMik] + cB[k}). 

[S3] Update second group of primal variables: 

B[k + 1] = A[fc + 1] + (R'R + If)- 1 [R'(Y - X[A + 1] - RA[ife + 1]) - c _1 (M[fe] -R'M[fc]) 



k <- k + 1 
end while 
return A[k],X[k] 



Conceivably, F can be quite large, thus inverting the F x F matrix R'R + Ip to update B[fe + 1] 
could be complex computationally. Fortunately, the inversion needs to be carried out once, and can be 
performed and cached off-line. In addition, to reduce the inversion cost, the SVD of the compression matrix 
R = U^SrV^j can be obtained first, and the matrix inversion lemma can be subsequently employed 
to obtain [R'R + I^" 1 = [I L - V R CV' R ] , where C := diag (t^> Tq^r) and P = rank(R) <C F. 
Finally, note that the AD-MoM algorithm converges to the global optimum of the convex program (PI) 
as stated in the next proposition. 

Proposition 3: [4] For any value of the penalty coefficient c > 0, the iterates {X[fc], A [A;]} converge to 
the optimal solution of (PI) as k — )• oo. 

Remark 7 (Trade-off between stability and convergence rate): The APG algorithm exhibits a conver- 
gence rate guarantee of 0(l/k 2 ) [29], while AD-MoM only attains 0(l/k) [20]. For the problem con- 
sidered here, APG needs an appropriate continuation technique to achieve the predicted performance [24]. 
Extensive numerical tests with Algorithm 1 suggest that the convergence rate can vary considerably for 
different choices e.g., of the matrix R. The AD-MoM algorithm on the other hand exhibits less variability 
in terms of performance, and only requires tuning c. It is also better suited for the constrained formulation 
(PI), since it does not need to resort to a relaxation. 
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Fig. 1. Relative error e r := ||Ao — A||f/|| Ao \\f for various values of r and s where L = 105, F = 210, and T = 420. 
White represents exact recovery (e r ~ 0), while black represents e r ~ 1, 

VII. Performance Evaluation 
The performance of (PI) is assessed in this section via computer simulations. 

A. Exact recovery 

Data matrices are generated according to Y = Xo + V^Arj. The low-rank component Xo is generated 
from the bilinear factorization model Xo = WZ', where W and Z are L x r and T x r matrices with 
i.i.d. entries drawn from Gaussian distributions J\f(0,l/L) and J\f(0, 1/T), respectively. Every entry of 
Ao is randomly drawn from the set { — 1,0,1} with Pr(aj i j = —1) = Pr(cii,j = 1) = it/2. The columns 
of V# £ R FxL comprise the right singular vectors of the random matrix R = UrS^V^, with i.i.d. 
Bernoulli entries with parameter 1/2 (cf. Remark 2). The dimensions are L = 105, F = 210, and T = 420. 
To demonstrate that (PI) is capable of recovering the exact values of {Xo, Ao}, the optimization problem 
is solved for a wide range of values of r and s using the APG algorithm (cf. Algorithm 1). 

Let A denote the solution of (PI) for a suitable value of A. Fig. 1 depicts the relative error in recovering 
Ao, namely ||A— Ao||f/||Ao||f for various values of r and s. It is apparent that (PI) succeeds in recovering 
Ao for sufficiently sparse Ao and low-rank Xo from the observed data Y. Interestingly, in cases such 
as s = 0.1 x FT or r = 0.3 x min(L,T) there is hope for recovery. In this example, one can exactly 
recover {Xo, Ao} when s = 0.0127 x FT and r = 0.2381 x min(L,T). A similar trend is observed for 
the recovery of Xo, and the corresponding plot is omitted to avoid unnecessary repetition. For different 
sizes of the matrix R, performance results averaged over ten realizations of the experiment are listed in 
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TABLE I 

Recovery performance by varying the size of R when r = 10 and tt = 0.05. 



L 


rank(Xo) 


IIAollo 


rank(X) 


l|A||o 


||A-A ||f/||Ao||f 


F 


10 


4410 


10 


4419 


2.0809 x 10" 6 


F/2 


10 


4410 


10 


4407 


6.4085 x 10" 5 


F/3 


10 


4410 


10 


9365 


7.76 x 10 -2 


F/5 


10 


4410 


14 


14690 


6.331 x 10 _1 



TABLE II 

Performance comparison of LS-PCP and Algorithm 1 averaged over ten random realizations 



Algorithm 


r = 5, tt = 0.01 


r = 5, tt = 0.05 


r = 10, tt = 0.01 


r = 10, tt = 0.05 


LS-PCP 


0.6901 


0.6975 


0.7001 


0.7023 


Algorithm 1 


7.81 x 10 -6 


3.037 x 10" 5 


1.69 x 10~ 5 


6.4 x 10~ 5 



Table I. The smaller the compression ratio L/F becomes, less observations are available and performance 
degrades accordingly. In particular, the error performanc e degrades significantly for a challenging instance 
where L/F = 0.2 and r = 0.4 x min(L, F) (cf. the last row of Table I). 

The results of [10] and [14] assert that exact recovery of {Xo, Ao} from the observations Y = Xo + Ao 
is possible under some technical conditions. Even though the algorithms therein are not directly applicable 
here due to the presence of R, one may still consider applying PCP after suitable pre-processing of Y. 
One possible approach is to find the LS estimate of the superposition Xo + Ao as Y = R^Y, and then 
feed a PCP algorithm with Y to obtain {Xo, Ao}. Comparisons between (PI) and the aforesaid two-step 
procedure are summarized in Table II. It is apparent that the heuristic performs very poorly, which is 
mainly due to the null space of matrix R (when F = 2L) that renders LS estimation inaccurate. 

B. Unveiling network anomalies via sparsity and low rank 

In the backbone of large-scale networks, origin-to-destination (OD) traffic flows experience abrupt 
changes which can result in congestion, and limit the quality of service provisioning of the end users. These 
so-termed traffic volume anomalies could be due to external sources such as network failures, denial of 
service attacks, or, intruders which hijack the network services [35], [23], [39]. Unveiling such anomalies is 
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a crucial task towards engineering network traffic. This is a challenging task however, since the available 
data are usually high-dimensional noisy link-load measurements, which comprise the superposition of 
unobser\>able OD flows as explained next. 

Consider a backbone network with topology represented by the directed graph G(N,C), where C and 
M denote the set of links and nodes (routers) of cardinality \C\ = L and \M\ = N, respectively. The 
network transports F end-to-end flows associated with specific OD pairs. For backbone networks, the 
number of network layer flows is typically much larger than the number of physical links (F S> L). 
Single-path routing is considered here to send the traffic flow from a source to its intended destination. 
Accordingly, for a particular flow multiple links connecting the corresponding OD pair are chosen to carry 
the traffic. Sparing details that can be found in [25], the traffic Y := [yi ;t ] £ M LxT carried over links 
I € C and measured at time instants t £ [1,T], can be compactly expressed as 



where the fat routing matrix R := [rej] G {0, l} LxF is fixed and given, Z := [zf t] denotes the unknown 
'clean' traffic flows over the time horizon of interest, A := [ajj] collects the traffic volume anomalies 
across flows and time, and E := [en] captures measurement errors. 

Common temporal patterns among the traffic flows in addition to their periodic behavior, render most 
rows (respectively columns) of Z linearly dependent, and thus Z typically has low rank [23], [32]. 
Anomalies are expected to occur sporadically over time, and only last for short periods relative to the 
(possibly long) measurement interval [1, T]. In addition, only a small fraction of the flows are supposed 
to be anomalous at any given time instant. This renders the anomaly matrix A sparse across rows and 
columns. Given link measurements Y and the routing matrix R, the goal is to estimate A by capitalizing 
on the sparsity of A and the low-rank property of Z. Since the primary goal is to recover A, define 
X := RZ which inherits the low-rank property from Z, and consider 



which is identical to (1) modulo small measurement errors in E G R LxT . If E = LxT , then (PI) can be 
used to unveil network anomalies, whereas (P2) is more suitable for a noisy setting. 
Remark 8 (Distributed algorithms): Implementing Algorithms 1 and 2 presumes that network nodes 
communicate their local link traffic measurements to a central processing unit, which uses their aggregation 
in Y to determine network anomalies. Collecting all this information can be challenging due to excessive 
protocol overhead, or, may be even impossible in e.g., wireless sensor networks operating under stringent 
power budget constraints. Performing the optimization in a centralized fashion raises robustness concerns 



Y = R(Z + A) + E 



(39) 



Y = X + RA + E 



(40) 
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Fig. 2. Network topology graph. 




(a) (b) 

Fig. 3. Performance for synthetic data, (a) ROC curves of the proposed versus the PCA-based method with n = 0.001, r = 10 
and o = 0.1. (b) Amplitude of the true and estimated anomalies for Pf = 10 -4 and Po = 0.97. Lines with open and filled 
circle markers denote the true and estimated anomalies, respectively. 

as well, since the central node carrying out the specific task at hand represents an isolated point of 
failure. These reasons motivate devising fully-distributed algorithms for unveiling anomalies in large scale 
networks, whereby each node carries out simple computational tasks locally, relying only on its local 
measurements and messages exchanged with its directly connected neighbors. This is the subject dealt 
with in an algorithmic companion paper [26], which puts forth a general framework for in-network sparsity- 
regularized rank minimization. 

Synthetic network data. A network of N = 20 agents is considered as a realization of the random 
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geometric graph model, that is, agents are randomly placed on the unit square and two agents communicate 
with each other if their Euclidean distance is less than a prescribed communication range of 0.35; see 
Fig. 2. The network graph is bidirectional and comprises L = 106 links, and F = N(N — 1) = 380 OD 
flows. For each candidate OD pair, minimum hop count routing is considered to form the routing matrix 
R. With r = 10, matrices {Xo, Ao} are generated as explained in Section VII-A. With reference to (39), 
the entries of E are i.i.d., zero-mean, Gaussian with variance a 2 , i.e., eij ~ yV(0,o" 2 ). 
Real network data. Real data including OD flow traffic levels are collected from the operation of the 
Internet2 network (Internet backbone network across USA) [1]. OD flow traffic levels are recorded for a 
three-week operation of Internet2 during Dec. 8-28, 2008 [23]. Internet2 comprises N = 11 nodes, L = 41 
links, and F = 121 flows. Given the OD flow traffic measurements, the link loads in Y are obtained 
through multiplication with the Internet2 routing matrix [1]. Even though Y is 'constructed' here from flow 
measurements, link loads can be typically acquired from simple network management protocol (SNMP) 
traces [35]. The available OD flows are a superposition of 'clean' and anomalous traffic, i.e., the sum of 
unknown 'ground-truth' low-rank and sparse matrices Xo + Ao adhering to (39) when R = 1^. Therefore, 
PCP is applied first to obtain an estimate of the 'ground-truth' {Xo, Ao}. The estimated Xo exhibits three 
dominant singular values, confirming the low-rank property of Xo- 

Comparison with the PCA-based method. To highlight the merits of the proposed anomaly detection 
algorithm, its performance is compared with the workhorse PCA-based approach of [23]. The crux of 
this method is that the anomaly-free data is expected to be low-rank, whereas the presence of anomalies 
considerably increases the rank of Y. PCA requires a priori knowledge of the rank of the anomaly-free 
traffic matrix, and is unable to identify anomalous flows, i.e., the scope of [23] is limited to a single 
anomalous flow per time slot. Different from [23], the developed framework here enables identifying 
multiple anomalous flows per time instant. To assess performance, the detection rate will be used as figure 
of merit, which measures the algorithm's success in identifying anomalies across both flows and time. 

For the synthetic data case, ROC curves are depicted in Fig. 3 (a), for different values of the rank 
required to run the PCA-based method. It is apparent that the proposed scheme detects accurately the 
anomalies, even at low false alarm rates. For the particular case of Pp = 10~ 4 and P D = 0.97, Fig. 3 
(b) illustrates the magnitude of the true and estimated anomalies across flows and time. Similar results 
are depicted for the Intemet2 data in Fig. 4, where it is also apparent that the proposed method markedly 
outperforms PCA in terms of detection performance. For an instance of Pp = 0.04 and Pp> = 0.93, Fig. 
4 (b) shows the effectiveness of the proposed algorithm in terms of unveiling the anomalous flows and 
time instants. 
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(a) (b) 

Fig. 4. Performance for Internet2 network data, (a) ROC curves of the proposed versus the PCA-based method, (b) Amplitude 
of the true and estimated anomalies for Pf = 0.04 and Pd = 0.93. Lines with open and filled circle markers denote the true 
and estimated anomalies, respectively. 



VIII. Closing Comments 

This paper deals with recovery of low-rank plus compressed sparse matrices via convex optimization. 
The corresponding task arises with network traffic monitoring, brain activity detection from undersampled 
fMRI, and video surveillance tasks, while it encompasses compressive sampling and principal components 
pursuit. To estimate the unknowns, a convex optimization program is formulated that mininimizes a trade- 
off between the nuclear and ^i-norm of the low-rank and sparse components, respectively, subject to 
a data modeling constraint. A deterministic approach is adopted to characterize local identifiability and 
sufficient conditions for exact recovery via the aforementioned convex program. Intuitively, the obtained 
conditions require: i) incoherent, sufficiently low-rank and sparse components; and ii) a compression 
matrix that behaves like an isometry when operating on sparse vectors. Because these conditions are in 
general NP-hard to check, it is shown that matrices drawn from certain random ensembles can be recovered 
with high probability. First-order iterative algorithms are developed to solve the nonsmooth optimization 
problem, which converge to the globally optimal solution with quantifiable complexity. Numerical tests 
with synthetic and real network data corroborate the effectiveness of the novel approach in unveiling traffic 
anomalies across flows and time. 

One can envision several extensions to this work, which provide new and challenging directions for 
future research. For instance, it seems that the requirement of an orthonormal compression matrix is only a 
restriction imposed by the method of proof utilized here. There should be room for tightening the bounds 
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used in the process of constructing the dual certificate, and hence obtain milder conditions for exact 
recovery. It would also be interesting to study stability of the proposed estimator in the presence of noise 
and missing data. In addition, one is naturally tempted to search for a broader class of matrices satisfying 
the exact recovery conditions, including e.g., non block-diagonal and binary routing (compression) matrices 
arising with the network anomaly detection task. 

Appendix 

A. Proof of Lemma 2: Suppose {Xo, Ao} is an optimal solution of (PI). For the nuclear norm and the 
£i-norm at point {Xo, Ao} pick the subgradients UV' + Wo and sign(Ao) + Fo, respectively, satisfying 
the optimality condition 

Asign(Ao) + AF = R'(UV' + W). (41) 

Consider a feasible solution {Xo + RH, Ao — H} for arbitrary nonzero H. The subgradient inequality 
yields 

||X + RH||* + A|| A - H|| >||X ||* + A|| A ||i + (UV' + W , RH) - A(sgn(A ) + F , H) . 

" „ ' 

:=<p(H) 

To guarantee uniqueness, y(H) must be positive. Rearranging terms one obtains 

p(H) = (W , RH) - A(F , H) + (RW - Asign(A ), H). (42) 

The value of Wo can be chosen such that (Wo, RH) = \\V$± (RH)||*. This is because, ||'P$j.(RH)|| j|< = 

su P||W||<i l(W,7V(RH)}|, thus there exists a w such that (^ > *- L (W),RH) = ||7V(RH)||*. One 
can then choose W := 7V(W) since ||7V(W)|| < ||W|| < 1 and Vq>(W ) = LxT . Similarly, if 
one selects Fo := — Vq-l (sign(H)), which satisfies Vq(Fq) = OfxT and ||Fo lloo = 1> then (Fo,H) = 
— ||^'n- L (H)||i. Now, using (41), equation (42) is expressed as 

<p(H) = ||^(RH)|| +A||P n x(H)|| + (AF-RW,H). 

From the triangle inequality |(AF - RW,H)| < A|(F,H)| + |(R'W,H)|, it thus follows that 

y>(H) > (ll^(RH)H, - |(R'W,H)|) + A(||^(H)||i - |(F,H)|) . (43) 

Since 7V(W) = W, it is deduced that |(W,RH)| = |(W,7V(RH))| < ||W||||P$x(RH)||*. 
Likewise, ?V(F) = F yields |(F,H)| = |(F,P^(H))| < ||F || oo 1 (H) || i . As a result 

<p(H) >(l-||W||)||^(RH)||* + A(l-||F|| 0O )||T' n x(H)|| 1 

> (l-max{||W||,||F|| 00 }){||^(RH)||, + A||^(H)|| 1 }. (44) 



IEEE TRANSACTIONS ON INFORMATION THEORY (SUBMITTED) 



30 



Now, if ||W|| < 1 and HFl^ < 1, since $ n £l R = {0 LxT } and RH / LxT , VH G ^\{0 Fx t}, there 
is no H £ fi for which RH £ <3?, and therefore, <^(H) > 0. 

Since W and F are related through (41), upon defining T := R'(UV + W), which is indeed the dual 
variable for (PI), one can arrive at conditions Cl)-C4). ■ 
B. Proof of Lemma 3: To establish that the rows of Aq are linearly independent, it suffices to show that 
||A / vec(H)|| > 0, for all nonzero H € SI. It is then possible to 

||A'vec(H)|| = ||(I - Py) ® (I - P f/ )Rvec(H)|| = ||(I - Pt/)RH(I - P V )\\ F 
= ||7V(RH)Hf = ||RH-7 7 $(RH)|| i r 

(a) (b) 

> \\RH\\ F - \\V*(BH)\\ F > \\mi\\ F (l- fi(Sl R ,<Z>)) (45) 

where (a) follows from the triangle inequality, and (b) from (3). The assumption #fc(R) < 1 along with the 

fact that no column of H has more than k nonzero elements, imply that RH ^ OlxT- Since fi(Sl r , <E>) < 1 

by assumption, the claim follows from (45). 

To arrive at the desired bound on cr m i n (Afj), recall the definition of the minimum singular value [21] 

||A'vec(H)|| ||(I-P [7 )RH(I-P y )|| F 
cr min (A^) = mm — = mm — — 

HeS]\{o FxT } ||vec(H)|| Hen\{o FxT } W^-Wf 

H min lB£ x l|P r (RH)|| F .j. el/ /2 |py(Z)||, 

He^\{0j, xr } ||H||_f ||RH||f zen R \{o LxT } \\Z\\f 

= c 1 / 2 (l-5,(R)) 1/2 min l|Z ~„^ (Z)l|F 
zen r \{o PXT } ||Z|| F 



> c vl \\ - 6 k CR)) l/2 1 - max 

^^(l-^R)) 1 / 2 ^-^,^))- 

In obtaining (c), the assumption <5fc(R) < 1 along with the fact that no column of H has more than k 
nonzero elements was used to ensure that RH ^ OlxT- In addition, (d) and (f) follow from the definitions 
(4) and (3), respectively, while (e) follows from the triangle inequality. ■ 
C. Proof of Lemma 4: Towards establishing the first bound, from the submultiplicative property of the 
spectral norm one obtains 

IIQH = ||A n xA' n (A^A^ 1 || < ||A^||||A' n (AnA^)" 1 ||. (46) 

Next, upper bounds are derived for both factors on the right-hand side of (46). First, using the fact that 
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A' A = A' u Aq + A' n± A n ± one arrives at 

^iA fi ix _ x'(A'A - A'^Aa)x 



\Aq± II 2 = max " — = max ■ 

x^O || x || 



x'A'Ax 



< max ^P-^P - min ~ J .| n M "~ = || A|| 2 - a 2 min (A' n ). (47) 



x^O ||x|| 2 x^O 

Note that A^ (AqA'q) -1 is the pseudo-inverse of the full row rank matrix Aq (cf. Lemma 3), and thus 
||Aq (AoAy" 1 || = o-^(A' n ) [21]. Substituting these two bounds into (46) yields 

"^(^"'^{(J^b)'- 1 }'- <48) 



In addition, it holds that 



|A|| 2 = A max {(I - Py) ® R(I - P C7 )R} 

= A max {(I - Py)} x A max {R'(I - Pu)R} 

( = } HR^I-Pf/)!! 2 ^!. (49) 



where in (a) and (b) it was used that the rows of R are orthonormal, and the maximum singular value of 
a projection matrix is one. Substituting (49) and the bound of Lemma 3 into (48), leads to (4). 
In order to prove the second bound, first suppose that ||I — AnA^Hc^ < 1. Then, one can write 



l A f^ A n ( A ^ A n) 1 lloo.oc — H A n-L A nlloo,oo|| ( A n A n) 



< II Aq± Aq||oo,oo|| (l — (I — A f2 A o)) ||oo,oo 

< II A n^ A nlloo,oo 

1 — ||I AqAqIIoq^ 

In what follows, separate upper bounds are derived for ||Aqj- AqIIo^oo and ||I— AqAqHoo^. For notational 
convenience introduce S := supp(Ao) (resp. S denotes the set complement). Starting with the numerator 
in the right-hand side of (50) 

HAqxAqHoo so =max||e-A n -LA' n || 1 = max V" |(e-A a _L,e' fc An)| 

k 

= max I ( e j A > e £ A ) I = m ?' x I ( AA ' e i ) e i) I 

3 e 3 e 

= max £ |(R(I-Pc / )Re il e; 2 (I-Py),e, 1 e^)| 

= max. V |(Re jl e; 2 (I-Py),(I-P u )Re 4 e / ^)|. (51) 

fei2)&s — ' 

:=9Uij2A,fo) 
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Following some manipulations, the term inside the summation can be further bounded as 

g(iuh,h,h) = |(Re jie ; 2 , (I - P f/ )Re, 1 e^ 2 ) - (Re^Py, (I - P^Re^J 
= Ke^e^R'Qt - Pc/)Re^) - (e^Pve^R'a - P^Re/J 
= le^R'CI - P C7 )Re, 1 l {j2= , 2} - (e^Pye^^R'^ - P C7 )Re, 1 )|. (52) 
Upon defining x ilA := e^R'(I - P C /)Re^ 1 and y j2A := (e^Pye^), squaring 5 gives rise to 

fUi,h,li,£2) = x 2 jiA t [h=l2} + yl A x 2 jlA - 2y hA x 2 jiA t {j2=e2} . (53) 
Since yj 2 ,£ 2 ^-{j 2 =£ 2 \ = \\^ > v e j 2 \\ 2 ^-{j 2 =e 2 } > 0, one can ignore the third summand in (53) to arrive at 

g(ji,h,ti,l2) < x jlA [^{j 2 =i 2 } + yl,£ 2 } 1/2 - (54) 

Towards bounding the scalars Xj u e 1 and ?/j 2 / 2 , rewrite Xj u e 1 := e^R'Re^ — e'^R'PyRe^. If j\ = t\, 
it holds that Xj u e 1 < HRe^JI 2 < c(l + 5i(R)); otherwise, 

x jlA < le^R'Re^J + (e^R'Pf/Re^ | < c0i,i(R) + c(l + <fi(R)) 7 £(U). 

Moreover, y j2i f 2 < ||Pyej 2 || ||Pve^ 2 || < 7 2 (V). Plugging the bounds into (54) yields 

9(jij2,hj2) < [c(l + *i(R))l b - 1=€l} + c(0 M (R) + c(l + <5i(R)) 7 |(U))l 0l#M ] 

x[l 02=M +7 4 (V)] 1 / 2 . (55) 
Plugging (55) into (51) one arrives at 
IIAo-l A'nllocoo < cfv^fc + s 7 2 (V)]0i,i(R) + c(l + 5i(R)) [fc 7 2 (V) + ^ 7 |(U) + s 7 |(U) 7 2 (V) 

:= cw max (56) 

after using: i) S n 5 = and consequently 72 / ^2 when ji = £1; and ii) 7 (V) < 1. 
Moving on, consider bounding ||I — AnA^||oo )00 that can be rewritten as 

||I - AnAn'||oo,oo = max 1 1 e/ (I - AqAq')I|i 

% 

= maxi |1- ||e/A n || 2 | + ^|(e i / An,e fc / A n )| 



max < 

i=ii+j2 
0'ij2)e5 



ll-HAe.-fl+^KAe^Ae^l \. (57) 
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In the sequel, an upper bound is derived for (57). Let (ji, j'2) denote the element of S associated with j 
in (57). For the first summand inside the curly brackets in (57), consider lower bounding the norm of the 
j-th row of A as 

||A'e,|| = \\(I-P u )Bj Bjl e , ja (I--Pv)\\ F = ||7V(Re,- e' h )\\ F 

= HRe^e^ - ^(Re Jie ; 2 )|| F > HRe^-J - ^(Re^)^ 

> HRe^jKi - /x($, n R )) > c V2(i _ 5 1 (r)) 1 /2 ( i _ Qr)) _ 

Since Si(R) < 1 and /j,($,Qr) < 1, one obtains |1 - ||A'e.,-|| 2 | < 1 - c(l - 5i(R))(l - ^($,Or)) 2 . 

For the second summand inside the curly brackets in (57), a procedure similar to the one used for 
bounding || A^j. Af/||oo,oo is pursued. First, observe that 

J2 \{AA'e jt e e )\ = £ |((I - Py) ® R(I - P C /)Re j , e e )\ 

|<R'(I - P^)Re iie ^(I - Pv), e^e^l 

(*iA)eS\{(jij a )} 

£ KRe J1 e; 2 (I-P y ),(I-P c/ )Re, 1 e', 2 )| (58) 

(^iA)e5\{(j 1 ,j 2 )} 

to deduce that, up to a summand corresponding to the index pair (j'1,,7'2), (58) is identical to the summation 
in (51). Following similar arguments to those leading to (55), one arrives at 

max y |(A'ej, A'e^l < cw max . 

3=3i+32 ±f. 
(jija)65 ^ 

Putting pieces together, (57) is bounded as 

||I- A n A' n <l-c(l-<5 1 (R))(l -^($,^)) 2 + cw max . (59) 

Note that because of the assumption w max < (1 — <5i(R))(l — /i(3>, ^_r)) 2 , ||I — AqAqHoo^ < 1 as 
supposed at the beginning of the proof. Substituting (56) and (59) into (50) yields the desired bound. ■ 
D. Proof of Lemma 6: The proof bears some resemblance with those available for the matrix completion 
problem [9], and PCP [10]. However, presence of the compression matrix R gives rise to unique challenges 
in some stages of the proof, which necessitate special treatment. In what follows, emphasis is placed on 
the distinct arguments required by the setting here. 

The main idea is to obtain first an upper bound on the norm of the linear operator 7r _1 'P$R'PnR / 'P$ — 
■p$, which is then utilized to upper bound = ||7 ? $R.7 ? q||. The former is established in the next 

lemma; see Appendix E for a proof. 



IEEE TRANSACTIONS ON INFORMATION THEORY (SUBMITTED) 34 

Lemma 10: Suppose S := supp(Ao) is drawn according to the Bernoulli model with parameter tt. 



Let A := y c(l + 5i(R))[7^(U) +7 2 (V)], and n := max{L, F}. Then, there are positive numerical 
constants C and r such that 



7r 1 ||7 ? $R'PnR/7 ? * -7rP$|| < Cy ° g ^^ +rAlog(ra) (60) 

w/f/i probability higher than 1 — O (n~ ClT ), provided that the right-hand side is less than one. 
Building on (60), it follows that 

(a) 

HT^RT^R'^H - tt < H^RPnR'^ll 

< T^RT^R'T^ - ttV^\\ 



< C^Jn\og{LF) + r^Alog(n) (61) 

where (a) and (b) come from \\V$\\ < 1 and the triangle inequality, respectively. In addition, 

\\Vn(R!V*(X))f F = \(Vn(R'V*(X)),Vn(R'V*(X)))\ 
= \(V$(R(Pn(R!Pz(X)))),X)\ 

< \\V*(R(Vn(R'V*{X))))\\ F \\X\\ F (62) 

for all X S M LxF . Recalling the definition of the operator norm, it follows from (62) that //(<&, Or) < 
v/c-^l - £ fc (R))- 1 ||7 ? *R7 ? nR / 7 ? *|| 1/2 - Plugging the bound (61), the result follows readily. ■ 
E. Proof of Lemma 10: Start by noting that 

R'7>$(X) =^(R / P$(X),e i e;.)e 4 e;. = £(X, ^(Re^))^- 

and apply the sampling operator to obtain 

7>n(R'P$(X)) = 5^6 iJ (X,P*(Re i e;))e i e; 

where {fejj} are Bernoulli-distributed i.i.d. random variables with Pr(6jj = 1) = 7r. Then, 

7^(R7^(R%(X))) = ^6 iJ (X,P $ (Re^))P $ (Re i e^). (63) 

Moreover, since RR' = I/, one finally arrives at 

7>$(X) = n(RR'P$(X)) = ^6 ij (X,7 3 $(Reie;))^(Re^). (64) 
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The next bound will also be useful later on 

||P # (R ei e/)||| = (P*(Reje/),Reie/) 

= (Pf/Re^e/ + Re,e/Py — P^Reje/Py, Re,e/) 

= (P^/Reje/, Reje/) + (Re^e/Py, Re^e/) — (Pf/Re^e/Py , Re^e/) 

= IIPc/Reie/lll + ||Re ie /Py||| - ||Pi 7 Reie i / |||.||Pve J -|||. 

< c(l + 5 1 (R)) 7 |(U) + c(l + 5i(R)) 7 2 (V) = A 2 (65) 

where (a) holds because (P^/Re^e/Py, Re^e/) = (e'jRPfyRej, e^-Pyej) and P;y = P^ (likewise Py). 
Defining the random variable 3 := 7r~ 1 ||'P$R'PnR / 'P$ — 7r"P$|| and using (64), one can write 

S = 7r" 1 sup II V(6 iJ -7r)(X,P, I) (Re i e;.))7 : ' (E) (Re 4 e;.) 

l|X|| F =l 11 y F 

= 7T _1 sup / (&i j ~~ vr)vec(X)'vec[P$(Reje / )] <g> vec['P$(Reie' )] 

l|vec(X)||=l 11 



tt" 1 ! - 7r)vec[P$(R ei ej.)] ® vecp^Re^)] 



(66) 



Random variables {bij — ir} are i.i.d. with zero mean, and thus one can utilize the spectral concentration 
inequality in [33, Lemma 3.5] to find 



E[3] < C X h^± max \\V,(Ke t e')\\ F < cJ^^A (67) 

V 7T i,j J V 7T 

for some constant C > 0, where (b) is due to (65). Now, applying Talagrand's concentration tail bound [34] 
to the random variable 3 yields 

Pr(|3-E[3]| >t)< 3exp (- tl °^ 2 K min{l,t}\ (68) 

for some constant K > 0, where t := rAlog(n) and n := m&x{L, F}. The arguments leading to (67) 
and (68) are similar those used in [9, Theorem 4.2] for the matrix completion problem, and details are 
omitted here. Putting (67) and (68) together it is possible to infer 



3 < E[S] + t < cy lQg ^ F) + rA log(n) (69) 
with probability higher than 1 — 0(n~ CnAr ), which completes the proof of the lemma. ■ 
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